Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for structured parameters #47

Merged
merged 10 commits into from
Jul 23, 2024
Merged

Conversation

klueska
Copy link
Contributor

@klueska klueska commented Apr 29, 2024

This PR adds support for structured parameters to the example DRA driver for Kubernetes v1.30 (removing the support for "classic" DRA in the process). A new branch called classic-dra has been created to preserve support for it.

Closes #44

@k8s-ci-robot k8s-ci-robot requested review from byako and pohly April 29, 2024 12:33
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 29, 2024
@klueska
Copy link
Contributor Author

klueska commented Apr 29, 2024

/cc @pohly @bart0sh @aojea @asm582

@k8s-ci-robot k8s-ci-robot requested a review from bart0sh April 29, 2024 12:34
@k8s-ci-robot
Copy link
Contributor

@klueska: GitHub didn't allow me to request PR reviews from the following users: asm582.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @pohly @bart0sh @aojea @asm582

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested a review from aojea April 29, 2024 12:34
@klueska klueska force-pushed the structured-parameters branch 3 times, most recently from ab745f7 to f848603 Compare April 29, 2024 12:40
README.md Outdated
@@ -117,11 +136,48 @@ Spec:
...
```

**Note**: When running with `StructuredParameters` you can also check that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to publish NodeAllocationState when running with structured parameters?

That's not going to get updated, is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it did get updated. But how? Is that something that the kubelet plugin does?

I can imagine that it was easy to do this because the code was already there, but as an example of how to write a DRA driver I find it confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to make the minimal changes necessary to support both classic DRA and structured parameters in the same code base. It would definitely look different if this weren't a goal. In fact, I would probably move all the state of already-prepared claims out of the CRD and instead checkpoint in a file on the node. Is it worth making this separation in this PR into two separate plugins / controllers that get deployed depending on the withStructuredParameters flag?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make the code more realistic for folks who only want to do structured parameters. I think it's worthwhile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has now been updated to strip out support for classic DRA


klog.Infof("Created ResourceClaimParameters for GpuClaimParameters %s/%s", namespace, gpuClaimParameters.Name)
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-trivial amount of code. Would it be worth adding unit tests or do we leave that as an exercise to the reader?

The demo works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always worth it to add unit tests :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to skip unit tests for now since this goes away in 1.31 anyway.

Next, deploy four example apps that demonstrate how `ResourceClaim`s,
`ResourceClaimTemplate`s, and custom `ClaimParameter` objects can be used to
request access to resources in various ways:
```bash
kubectl apply --filename=gpu-test{1,2,3,4}.yaml
kubectl apply --filename=demo/gpu-test{1,2,3,4}.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Showing that parameters got converted when using structured parameters would be useful here. OTOH, it's not very impressive right now. So perhaps not (yet)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not very interesting yet ... I will definitely add this once I open a PR introduce selection via a vendor-specific CRD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait to do this as part of the 1.31 update

@klueska klueska force-pushed the structured-parameters branch from f848603 to 1d1e362 Compare July 10, 2024 23:45
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 10, 2024
@klueska klueska force-pushed the structured-parameters branch 3 times, most recently from 19178d5 to b6b1206 Compare July 11, 2024 09:08
README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
@klueska klueska force-pushed the structured-parameters branch from b6b1206 to 1e3c484 Compare July 15, 2024 08:24
cmd/dra-example-controller/claimparametersgen.go Outdated Show resolved Hide resolved
}

func (d *driver) NodePrepareResources(ctx context.Context, req *drapbv1.NodePrepareResourcesRequest) (*drapbv1.NodePrepareResourcesResponse, error) {
logger := klog.FromContext(ctx)
logger.Info("NodePrepareResource", "numClaims", len(req.Claims))
klog.Infof("NodePrepareResource is called: number of claims: %d", len(req.Claims))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why do you prefer using klog directly over context logger?

return nil
})

prepared, err := d.state.Prepare(claim.Uid, allocated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like claim preparation logic is implemented it state. I expected most of it here, in the driver.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does that leave for state to do then? Are you suggesting I get rid of the state indirection altogether?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would leave at least checkpointing and probably other state-related operations. The actual device preparation should be decoupled from the state management in my opinion.

Copy link
Contributor Author

@klueska klueska Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about pulling the list of allocatable devices up into the driver -- this is the main thing that the state abstraction maintains, and "preparation" at this level is basically checking if the allocated devices are actually in the allocatable list, checkpointing them, and then returning the set of CDI devices associated with them. Maybe you just don't like the name prepare() for this lower-level operation? Do you have another suggestion on the name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an impression that all resource preparation code is in the state.prepare. If this is not the case, than I'm ok with this.

@bart0sh
Copy link
Contributor

bart0sh commented Jul 15, 2024

looks good to me except of a couple of nits I've mentioned above.

One thing that I've noticed that the device preparation is handled by the driver state. I'd suggest to move it out of there to the driver. I understand that for the demo driver it can be empty, but it shouldn't be in the state implementation in my opinion.

@klueska klueska force-pushed the structured-parameters branch from 1e3c484 to 49c8d0a Compare July 16, 2024 11:28
metadata:
creationTimestamp: "2024-04-17T13:45:44Z"
generateName: dra-example-driver-cluster-worker-gpu.resource.example.com-
name: dra-example-driver-cluster-worker-gpu.resource.example.comxktph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to separate random suffix from the name, e.g.
dra-example-driver-cluster-worker-gpu.resource.example.comxktph -> dra-example-driver-cluster-worker-gpu.resource.example.com.xktph ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a question for @pohly -- this comes from his library for generating the name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GenerateName: dra-example-driver-cluster-worker-gpu.resource.example.com- already has a trailing -. But as that prefix is long, Kubernetes truncates it and drops the trailing -.

One way to avoid that would be to truncate the <node>-<driver>- string in advance if it is too long, for example by cutting out some characters in the middle of each sub-component. Would that be better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comes from his library for generating the name

Just to be clear: that library produces generateName, not the name - that is the name generated by Kubernetes.

Copy link
Contributor

@byako byako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

AddFunc: func(obj any) {
unstructured, ok := obj.(*unstructured.Unstructured)
if !ok {
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj)
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj)
return

Otherwise unstructured == nil is used on line 109

UpdateFunc: func(oldObj any, newObj any) {
unstructured, ok := newObj.(*unstructured.Unstructured)
if !ok {
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", newObj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", newObj)
klog.Errorf("Error converting object to *unstructured.Unstructured: %v", newObj)
return

return cdiDevices, nil
}

func (s *DeviceState) Unprepare(claimUID string) error {
s.Lock()
defer s.Unlock()

if s.prepared[claimUID] == nil {
checkpoint := newCheckpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the checkpoint need to be loaded every time unprepare or prepare is called? Is it some sort of protection from kubelet-plugin Pod being offline / suspended temporarily? If not, my impression was that the checkpoint can be loaded just once when the kubelet-plugin object is being created and new state object is created. After state is ready, then the prepared claims need to be saved into the checkpoint every time Prepare or Unprepare is called trusting that State object is the source of truth, and ignoring old data in snapshots.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: byako, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 6e99295 into main Jul 23, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Structured parameter implementation
5 participants